I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.
Related
I have a number of variables to group on, some are "normal" grouping variables (i.e. numeric/character strings) and some of which are lists of strings. What I want to do is determine a group based on matches between the "normal" variables & the presence of a match between any string within these lists.
Example data:
dat <- data.frame(cod=c(1:5,1:5,3))
dat$lst <- c(list(c("a","a"),c("b"),c("x","x"),c("r","r"),c("t","t"),c("a"),c("e"),c("f","x"),c("e","q"),c("t"),c("f","f")))
dat <- dat %>% arrange(cod)
#Data
cod lst
1 a, a
1 a
2 b
2 e
3 x, x
3 f, x
3 f, f
4 r, r
4 e, q
5 t, t
5 t
So where lst contains a string that is present in several of the same cods I want to create a group, like this:
# Desired output
cod lst grp
1 a, a 1
1 a 1
2 b 2
2 e 3
3 x, x 4
3 f, x 4
3 f, f 4
4 r, r 5
4 e, q 6
5 t, t 7
5 t 7
In grp 4, all three observations should be linked as there are common list items shared between all three (i.e. one observation has a lst value of c(f,x) which links c(f,f) and c(x,x) within the cod group 3)
I tried to just create a logical TRUE/FALSE column that would show if some cods should be grouped via:
dat %>%
group_by(cod) %>%
mutate(grp = ifelse((lst %in% lst),TRUE,FALSE))
As well as via:
for (i in 1:dim(dat)[1]) {
if (all(any(dat$cod[i] %in% dat$cod) & any(dat$lst[[i]] %in% unlist(dat$list[-i])))) {
dat$grp[i] <- TRUE
} else {
dat$grp[i] <- FALSE
}
}
But nothing has been able to isolate unique groups so far. Any help greatly appreciated!!!! Thanks!
I just started working with R for my master thesis and up to now all my calculations worked out as I read a lot of questions and answers here (and it's a lot of trial and error, but thats ok).
Now i need to process a more sophisticated code and i can't find a way to do this.
Thats the situation: I have multiple sub-data-sets with a lot of entries, but they are all structured in the same way. In one of them (50000 entries) I want to change only one value every row. The new value should be the amount of the existing entry plus a few values from another sub-data-set (140000 entries) where the 'ID'-variable is the same.
As this is the third day I'm trying to solve this, I already found and tested for and apply but both are running for hours (canceled after three hours).
Here is an example of one of my attempts (with for):
for (i in 1:50000) {
Entry_ID <- Sub02[i,4]
SUM_Entries <- sum(Sub03$Source==Entry_ID)
Entries_w_ID <- subset(Sub03, grepl(Entry_ID, Sub03$Source)) # The Entry_ID/Source is a character
Value1 <- as.numeric(Entries_w_ID$VAL1)
SUM_Value1 <- sum(Value1)
Value2 <- as.numeric(Entries_w_ID$VAL2)
SUM_Value2 <- sum(Value2)
OLD_Val1 <- Sub02[i,13]
OLD_Val <- as.numeric(OLD_Val1)
NEW_Val <- SUM_Entries + SUM_Value1 + SUM_Value2 + OLD_Val
Sub02[i,13] <- NEW_Val
}
I know this might be a silly code, but thats the way I tried it as a beginner. I would be very grateful if someone could help me out with this so I can get along with my thesis.
Thank you!
Here's an example of my data-structure:
Text VAL0 Source ID VAL1 VAL2 VAL3 VAL4 VAL5 VAL6 VAL7 VAL8 VAL9
XXX 12 456335667806925_1075080942599058 10153901516433434_10153902087098434 4 1 0 0 4 9 4 6 8
ABC 8 456335667806925_1057045047735981 10153677787178434_10153677793613434 6 7 1 1 5 3 6 8 11
DEF 8 456747267806925_2357045047735981 45653677787178434_94153677793613434 5 8 2 1 5 4 1 1 9
The output I expect is an updated value 'VAL9' in every row.
From what I understood so far, you need 2 things:
sum up some values in one dataset
add them to another dataset, using an ID variable
Besides what #yoland already contributed, I would suggest to break it down in two separate tasks. Consider these two datasets:
a = data.frame(x = 1:2, id = letters[1:2], stringsAsFactors = FALSE)
a
# x id
# 1 1 a
# 2 2 b
b = data.frame(values = as.character(1:4), otherid = letters[1:2],
stringsAsFactors = FALSE)
sapply(b, class)
# values otherid
# "character" "character"
Values is character now, we need to convert it to numeric:
b$values = as.numeric(b$values)
sapply(b, class)
# values otherid
# "numeric" "character"
Then sum up the values in b (grouped by otherid):
library(dplyr)
b = group_by(b, otherid)
b = summarise(b, sum_values = sum(values))
b
# otherid sum_values
# <chr> <dbl>
# 1 a 4
# 2 b 6
Then join it with a - note that identifiers are specified in c():
ab = left_join(a, b, by = c("id" = "otherid"))
ab
# x id sum_values
# 1 1 a 4
# 2 2 b 6
We can then add the result of the sum from b to the variable x in a:
ab$total = ab$x + ab$sum_values
ab
# x id sum_values total
# 1 1 a 4 5
# 2 2 b 6 8
(Updated.)
From what I understand you want to create a new variable that uses information from two different data sets indexed by the same ID. The easiest way to do this is probably to join the data sets together (if you need to safe memory, just join the columns you need). I found dplyr's join functions very handy for these cases (explained neatly here) Once you joined the data sets into one, it should be easy to create the new columns you need. e.g.: df$new <- df$old1 + df$old2
Following this worked example:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
ID <- c('aa','bb','zz','aa','cc','ee','ff','gg','kk','aa','kk','cc','dd')
score <- c(1,1,3,4,2,3,2,2,1,1,3,3,2)
df1 <- data.frame(case, ID, score)
identifier <- c('aa','bb','ff')
For each unique case, (that is a,b,c,d...), I want to scan the ID column and see how often we have an identifier value.
So we look into the 3x case==a, then how many times do the ID equal identifier? (in this case 2 times)
We then look at 2x case==b, and also count how many time ID equal identifier? (in this case 1 times)
we do this for all unique case's
I have used the following command, but this is for the whole sample, not separated per unique case
df1$ID %in% identifier
And what I want as a end result is a table, with one column with each unique case and a second column with the number of times ID and identifier were equal.
So I want to loop/automate the process and return a similiar output like:
data.frame(c('a','b','c','d','e'), c(2,1,1,1,0))
You can use tapply():
tapply(df1$ID, df1$case, FUN = function(id) sum(id %in% identifier))
a b c d e
2 1 1 1 0
but as #Jaap pointed out, you can use aggregate() to get a data.frame:
aggregate(ID ~ case, data = df1, FUN = function(id) sum(id %in% identifier))
case ID
1 a 2
2 b 1
3 c 1
4 d 1
5 e 0
And if you want more grouping you can do :
df <- aggregate(ID ~ case+(score>1), data = df1, FUN = function(id) sum(id %in% identifier))
df[df$`score > 1`,c(1,3)]
case ID
4 a 0
5 b 1
6 c 1
7 d 0
8 e 0
I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.
i have a data in the form:
id source
1 m
1 p
1 l1
1 l1
2 t
2 q
3 p
3 l1
3 n
3 l1
Now for every id, i want to identify l1 when it occurs in the source and extract the observation prior to l1.
For eg: for id 1, the 3rd source in l1 and the observation prior to that is p.
so my data should look like this:
id source
1 p
3 p
3 n
How can i create this in R?
A data.table solution
library(data.table)
dd <- data.table(df)
dd[, source[match('l1', source)-1L],by = id]
There might be a more direct method, but try this:
#get your data
test <- read.table(text="id source
1 m
1 p
1 l1
1 l1
2 t
2 q
3 p
3 l1
3 n
3 l1",header=TRUE)
# do some picking of the cases
result <- do.call(rbind,by(test,test$id,function(x) x[which(x$source=="l1")-1,]))
result <- result[result$source!="l1",]
Which gives:
> result
id source
2 1 p
7 3 p
9 3 n
Here is another data.table solution. I wasn't able to get what seemed like a correct answer with the earlier version from #mnel.
library(data.table)
## Create the test data table:
dt <- data.table(id=c(1,1,1,1,2,2,3,3,3,3),
source1=c("m","p","l1","l1","t","q","p","l1","n","l1"))
dt[,list(id, source1, source0=c(NA,source1[seq_len(.N-1L)]))][source1=="l1"]
## id source1 source0
## 1: 1 l1 p
## 2: 1 l1 l1
## 3: 3 l1 p
## 4: 3 l1 n
This is adding a column source0 to the data table that gets the previous row (or NA for the first row). The .N is a row number, and I'm using seq_len to get the previous row number. Then it subsets the result where the original source1 has a value of "l1".
Here is a vectorized solution using only simple functions from the base of R.
If DF is the input data frame then sel is a logical vector whose TRUE components select out the required rows. The three terms connected by & signs select those rows:
for which the following row's source column equals "l1" and
whose source column is not l1 and
are such that the following row is not the first with that id
The length of sel is one less than the number of rows in DF so we use which to avoid recycling of sel.
is.l1 <- DF$source == "l1"
sel <- is.l1[-1] & !is.l1[-nrow(DF)] & duplicated(DF$id)[-1]
DF[which(sel),]
The result of the last line is:
id source
2 1 p
7 3 p
9 3 n