Related
I am trying to extract score information for each ID and for each itemID. Here how my sample dataset looks like.
df <- data.frame(Text_1 = c("Scoring", "1 = Incorrect","Text1","Text2","Text3","Text4", "Demo 1: Color Naming","Amarillo","Azul","Verde","Azul",
"Demo 1: Errors","Item 1: Color naming","Amarillo","Azul","Verde","Azul",
"Item 1: Time in seconds","Item 1: Errors",
"Item 2: Shape Naming","Cuadrado/Cuadro","Cuadrado/Cuadro","Círculo","Estrella","Círculo","Triángulo",
"Item 2: Time in seconds","Item 2: Errors"),
School.2 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, NA,NA,NA,NA,NA,
0,"1 = Incorrect responses",0,1,NA,NA,NA,0,"1 = Incorrect responses",0,NA,NA,1,1,0,NA,0),
X_Elementary_School..3 = c("Bill:","X District","10/7/21","K","123-2222-2:",NA, NA,NA,NA,NA,NA,
NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,NA,NA),
School.4 = c("Teacher:","DC Name:","Date (mm/dd/yyyy):","Child Grade:","Student Study ID:",NA, 0,NA,1,NA,NA,0,"1 = Incorrect responses",0,1,NA,NA,120,0,"1 = Incorrect responses",NA,1,0,1,NA,1,110,0),
Y_Elementary_School..2 = c("John:","X District","11/7/21","K","112-1111-3:",NA, NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA,"Child response",NA,NA,NA,NA,NA,NA, NA,NA))
> df
Text_1 School.2 X_Elementary_School..3 School.4 Y_Elementary_School..2
1 Scoring Teacher: Bill: Teacher: John:
2 1 = Incorrect DC Name: X District DC Name: X District
3 Text1 Date (mm/dd/yyyy): 10/7/21 Date (mm/dd/yyyy): 11/7/21
4 Text2 Child Grade: K Child Grade: K
5 Text3 Student Study ID: 123-2222-2: Student Study ID: 112-1111-3:
6 Text4 <NA> <NA> <NA> <NA>
7 Demo 1: Color Naming <NA> <NA> 0 <NA>
8 Amarillo <NA> <NA> <NA> <NA>
9 Azul <NA> <NA> 1 <NA>
10 Verde <NA> <NA> <NA> <NA>
11 Azul <NA> <NA> <NA> <NA>
12 Demo 1: Errors 0 <NA> 0 <NA>
13 Item 1: Color naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
14 Amarillo 0 <NA> 0 <NA>
15 Azul 1 <NA> 1 <NA>
16 Verde <NA> <NA> <NA> <NA>
17 Azul <NA> <NA> <NA> <NA>
18 Item 1: Time in seconds <NA> <NA> 120 <NA>
19 Item 1: Errors 0 <NA> 0 <NA>
20 Item 2: Shape Naming 1 = Incorrect responses Child response 1 = Incorrect responses Child response
21 Cuadrado/Cuadro 0 <NA> <NA> <NA>
22 Cuadrado/Cuadro <NA> <NA> 1 <NA>
23 Círculo <NA> <NA> 0 <NA>
24 Estrella 1 <NA> 1 <NA>
25 Círculo 1 <NA> <NA> <NA>
26 Triángulo 0 <NA> 1 <NA>
27 Item 2: Time in seconds <NA> <NA> 110 <NA>
28 Item 2: Errors 0 <NA> 0 <NA>
This sample dataset is limited only for two schools, two teachers and two students.
In this step, I need to extract student responses for each item.
Wherever the first column has Item , I need to grab from there. I especially need to index the rows and columns columns rather than giving the exact row columns number since this will be for multiple datafiles and each files has different information. No need to grab the ..:Error part.
################################################################################
# ## 2-extract the score information here
# ## 1-grab item information from where "Item 1:.." starts
Here, rather than using row number, how to automate this part.
score<-df[c(7:11,13:17,20:26),c(seq(2,dim(df)[2],2))] # need to automate row and columns index here
score<-as.data.frame(t(score))
rownames(score)<-seq(1,nrow(score),1)
colnames(score)<-paste0('i',seq(1,ncol(score),1)) # assign col names for items
score<-apply(score,2,as.numeric) # only keep numeric columns
score<-as.data.frame(score)
score$total<-rowSums(score,na.rm=T); score # create a total score
> score
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5
Additionally, I need to add ID which I could not achieve here.
My desired output would be:
> score
ID i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 total
1 123-2222-2 NA NA NA NA NA NA 0 1 NA NA NA 0 NA NA 1 1 0 3
2 112-1111-3 0 NA 1 NA NA NA 0 1 NA NA NA NA 1 0 1 NA 1 5
I have a dataset basically looks like that, giving which campaigns are active for each household with given start and end dates of respective campaigns:
campaign_id household_id campaign_type start_date end_date
1 26 1 Type B 2016-12-28 2017-02-19
2 8 1 Type A 2017-05-08 2017-06-25
3 12 1 Type B 2017-07-12 2017-08-13
4 13 1 Type A 2017-08-08 2017-09-24
5 18 1 Type A 2017-10-30 2017-12-24
6 20 1 Type C 2017-11-27 2018-02-05
7 22 1 Type B 2017-12-06 2018-01-07
8 23 1 Type B 2017-12-28 2018-02-04
And I create a new dataframe with given structure, which will show which campaigns are active for given household in a given time (having all the campaign numbers as columns, i have omitted the rest while putting here):
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 NA NA NA NA
2 1 2016-12-06 NA NA NA NA
3 1 2016-12-28 NA NA NA NA
4 1 2017-02-08 NA NA NA NA
5 1 2017-03-03 NA NA NA NA
6 1 2017-03-08 NA NA NA NA
7 1 2017-03-13 NA NA NA NA
8 1 2017-03-29 NA NA NA NA
9 1 2017-04-03 NA NA NA NA
10 1 2017-04-19 NA NA NA NA
What I want to do is assigning the active promotions in the given dates as rows in the second dataframe. For example if household_id 1 is having campaign 2 running in 2016-11-14 but no other campaigns, then it will look like this:
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 0 1 0 0
How can i manage this construction, should I use for loops in the initial dataframe and assign to second one in each loop, or there is a better and faster way? Thanks in advance.
In my research I have a dataset of cancer patients with some clinical information like cancer stage and treatment etc. Each patient has one row in a table with this clinical information. In addition, each patient has, at one or several timepoints during the treatment, taken blood samples, depending on how long the patient has been followed at the clinic. The first sample is from the first visit and the second sample is from the second visit at the clinic, and so on.
In the table, there is a variable (ie. column) that is named Sample_Time_1, which is the time for the first sample. Sample_Time_2 has the time (date) for the second sample and so on.
However - the samples were analysed at the lab and I got the result in a pivottable, which means I have a table where each sample has one row and therefore the results from one patient is displayed on several rows.
For example, create two tables:
x <- c(1,2,2,3,3,3,3,4,5,6,6,6,6,7,8,9,9,10)
y <- as.Date(c("2011-05-17","2012-06-30","2012-08-11","2011-10-15","2011-11-25","2012-01-07","2012-02-15","2011-08-13","2012-02-03","2011-11-08","2011-12-21","2012-02-01","2012-03-12","2012-01-03","2012-04-20","2012-03-31","2012-05-10","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
z <- c(123,185,153,153,125,148,168,187,194,115,165,167,143,151,129,130,151,134)
Sheet_1 <- matrix(c(x,y,z), ncol=3, byrow=FALSE)
colnames(Sheet_1) <- c("ID","Sample_Time", "Sample_Value")
Sheet_1 <- as.data.frame(Sheet_1)
Sheet_1$Sample_Time <- y
x1 <- c(1,2,3,4,5,6,7,8,9,10)
x2 <- c(3,3,2,3,2,2,4,2,3,3)
x3 <- c(1,2,2,3,3,1,3,1,1,2)
x4 <- as.Date(c("2011-05-17","2012-06-30","2011-10-15","2011-08-13","2012-02-03","2011-11-08","2012-01-03","2012-04-20","2012-03-31","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
x5 <- as.Date(c(NA,"2012-08-11","2011-11-25",NA,NA,"2011-12-21",NA,NA,"2012-05-10",NA), format="%Y-%m-%d", origin="1960-01-01")
x6 <- as.Date(c(NA,NA,"2012-01-07",NA,NA,"2012-02-01",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
x7 <- as.Date(c(NA,NA,"2012-02-15",NA,NA,"2012-03-12",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
Sheet_2 <- as.data.frame(c(1:10))
colnames(Sheet_2) <- "ID"
Sheet_2$Stage <- x2
Sheet_2$Treatment <- x3
Sheet_2$Sample_Time_1 <- x4
Sheet_2$Sample_Time_2 <- x5
Sheet_2$Sample_Time_3 <- x6
Sheet_2$Sample_Time_4 <- x7
Sheet_2$Sample_Value_1 <- NA
Sheet_2$Sample_Value_2 <- NA
Sheet_2$Sample_Value_3 <- NA
Sheet_2$Sample_Value_4 <- NA
I would like to transfer the Sample_Value for the first date a sample was taken from a patient from Sheet_1 to Sheet_2$Sample_Value_1 and if there are more samples, I would like to transfer them to column "Sample_Value_2" and so on.
I have tried with a double for-loop. For each patient (=ID) in Sheet_1 I have run through Sheet_2 and if there is a mach on ID, then I use another for-loop to see if there is a mach on a Sample_Time and insert (using if) the Sample_Value. However, I do not manage to get it to work and I have a strong feeling there must be a better way.
Any suggestions?
Is this what you want:
Prepare Sheet_1 for reshaping from long to wide by introducing an extra column with unique ID for each blood sample per patient
Sheet_1$uniqid <- with(Sheet_1, ave(as.character(ID), ID, FUN = seq_along))
And with this, do the re-shaping
S_1 <- reshape( Sheet_1, idvar = "ID", timevar = "uniqid", direction = "wide")
which gives you
> S_1
ID Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2 Sample_Time.3
1 1 2011-05-17 123 <NA> NA <NA>
2 2 2012-06-30 185 2012-08-11 153 <NA>
4 3 2011-10-15 153 2011-11-25 125 2012-01-07
8 4 2011-08-13 187 <NA> NA <NA>
9 5 2012-02-03 194 <NA> NA <NA>
10 6 2011-11-08 115 2011-12-21 165 2012-02-01
14 7 2012-01-03 151 <NA> NA <NA>
15 8 2012-04-20 129 <NA> NA <NA>
16 9 2012-03-31 130 2012-05-10 151 <NA>
18 10 2011-12-15 134 <NA> NA <NA>
Sample_Value.3 Sample_Time.4 Sample_Value.4
1 NA <NA> NA
2 NA <NA> NA
4 148 2012-02-15 168
8 NA <NA> NA
9 NA <NA> NA
10 167 2012-03-12 143
14 NA <NA> NA
15 NA <NA> NA
16 NA <NA> NA
18 NA <NA> NA
The number after the dot in the colnames is the uniqid.
Now you can merge the relevant columns from Sheet_2
S_2 <- merge( Sheet_2[ 1:3 ], S_1, by = "ID" )
and the result should be what you are looking for:
> S_2
ID Stage Treatment Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2
1 1 3 1 2011-05-17 123 <NA> NA
2 2 3 2 2012-06-30 185 2012-08-11 153
3 3 2 2 2011-10-15 153 2011-11-25 125
4 4 3 3 2011-08-13 187 <NA> NA
5 5 2 3 2012-02-03 194 <NA> NA
6 6 2 1 2011-11-08 115 2011-12-21 165
7 7 4 3 2012-01-03 151 <NA> NA
8 8 2 1 2012-04-20 129 <NA> NA
9 9 3 1 2012-03-31 130 2012-05-10 151
10 10 3 2 2011-12-15 134 <NA> NA
Sample_Time.3 Sample_Value.3 Sample_Time.4 Sample_Value.4
1 <NA> NA <NA> NA
2 <NA> NA <NA> NA
3 2012-01-07 148 2012-02-15 168
4 <NA> NA <NA> NA
5 <NA> NA <NA> NA
6 2012-02-01 167 2012-03-12 143
7 <NA> NA <NA> NA
8 <NA> NA <NA> NA
9 <NA> NA <NA> NA
10 <NA> NA <NA> NA
Using the information on hand I need to predict how much of a particular product we need next month. I have several months worth of data going back, however the data is separated by both VPN and by a separate warehouse number. I just need to know how much to order in general and ignore the warehouse separation. we'll be adding that back in later.
There are multiple duplicates of many of the VPN's and i would like to consolidate all the duplicates and also sum the numbers that have been separated.
VPN Month To Date December November October September August July June May April March
0A36227-AA 15 6 4 2 NA 4 6 4 2 <NA> 4
0A36227-AA NA 1 NA NA NA NA 1 <NA> <NA> <NA> <NA>
0A36227-AA 2 3 1 NA 2 3 3 1 <NA> 2 3
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA 1 NA NA NA <NA> 1 <NA> <NA> <NA>
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA NA NA NA NA <NA> <NA> <NA> <NA> <NA>
So i want to combine all the duplicates and add all the numbers from the rows into just one row per VPN.
I've tried using the aggregate function and it didn't work for me. i may have used it wrong though.
any help would be appreciated!
also there are some cases where it may cause an infinite number to show up. if anyone has any further advice for how to handle that it would be welcome.
You basically want to know how to perform sum while grouping in your data frame.
You will find plenty of answer.
I have a data.table solution for your case:
plouf <- read.table(text = " VPN Month.To.Date December November October September August July June May April March
0A36227-AA 15 6 4 2 NA 4 6 4 2 <NA> 4
0A36227-AA NA 1 NA NA NA NA 1 <NA> <NA> <NA> <NA>
0A36227-AA 2 3 1 NA 2 3 3 1 <NA> 2 3
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA 1 NA NA NA <NA> 1 <NA> <NA> <NA>
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA NA NA NA NA <NA> <NA> <NA> <NA> <NA>",
stringsAsFactors = FALSE, header = TRUE)
here is the code
DT <- setDT(plouf)
tochange <- names(DT)[!names(DT) %in% "VPN"]
here the tochange vector is the list of your column you want to average
DT[,c(tochange) := lapply(.SD,function(x){as.numeric(x)}),.SDcols = tochange]
DT[,lapply(.SD,function(x){sum(x,na.rm = TRUE)}),.SDcols = tochange,by = VPN]
The first line is to set everything to numeric¨
The second line perform the sum ignoring the NAs and grouping by VPN. I am not 100% sure that is what you wanted.
VPN Month.To.Date December November October September August July June May April March i
1: 0A36227-AA 17 10 5 2 2 7 10 5 2 2 7 10
2: 0A36258-AA 2 0 1 2 0 0 0 1 2 0 0 0
I hope it helps
here is the dplyr equivalent
plouf %>%
mutate_at(vars(tochange),funs(as.numeric)) %>%
group_by(VPN) %>%
summarise_at(vars(tochange),funs(sum(.,na.rm = TRUE)))
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Replacing NAs with latest non-NA value
How can I fill up missing information using the previous values for each column?
Date.end Date.beg Pollster Serra.PSDB
2012-06-26 2012-06-25 Datafolha 31.0
2012-06-27 <NA> <NA> NA
2012-06-28 <NA> <NA> NA
2012-06-29 <NA> <NA> NA
2012-06-30 <NA> <NA> NA
2012-07-01 <NA> <NA> NA
2012-07-02 <NA> <NA> NA
2012-07-03 <NA> <NA> NA
2012-07-04 <NA> Ibope 22
2012-07-05 <NA> <NA> NA
2012-07-06 <NA> <NA> NA
2012-07-07 <NA> <NA> NA
2012-07-08 <NA> <NA> NA
2012-07-09 <NA> <NA> NA
2012-07-10 <NA> <NA> NA
2012-07-11 <NA> <NA> NA
2012-07-12 2012-07-09 Veritá 31.4
I'm not sure if that is the best way to do it. Probably there is some package with exactly that functionality out there. The following approach might not be the one with the very best performance, but it certainly works and should be fine for small to medium datasets. I would be cautious to apply it for very large datasets (more than a million rows or something like that)
fillNAByPreviousData <- function(column) {
# At first we find out which columns contain NAs
navals <- which(is.na(column))
# and which columns are filled with data.
filledvals <- which(! is.na(column))
# If there would be no NAs following each other, navals-1 would give the
# entries we need. In our case, however, we have to find the last column filled for
# each value of NA. We may do this using the following sapply trick:
fillup <- sapply(navals, function(x) max(filledvals[filledvals < x]))
# And finally replace the NAs with our data.
column[navals] <- column[fillup]
column
}
Here is some example using a test dataset:
set.seed(123)
test <- 1:20
test[floor(runif(5,1, 20))] <- NA
> test
[1] 1 2 3 4 5 NA 7 NA 9 10 11 12 13 14 NA 16 NA NA 19 20
> fillNAByPreviousData(test)
[1] 1 2 3 4 5 5 7 7 9 10 11 12 13 14 14 16 16 16 19 20