Seeing the huge possibilities by R to avoid endless Excel-Copy-and-paste-sessions I started exploring it, but now I face trouble formatting several (15+) .csv tables. They all have the same appearance:
Date;Value;Name
1.1.2011;30;GWM5
1.2.2011;31;GWM5
1.3.2011;35;GWM5
1.4.2011;31;GWM5
1.1.2011;23;GWM5
1.2.2011;24;GWM5
1.3.2011;21;GWM5
1.4.2011;22;GWM5
As you can see, the name is constant, the date is reoccurring, only the value is variable. I need to have it in a table like this:
Name;date;value1;value2
GWM5;1.1.2011;30;23
GWM5;1.2.2011;31;24
GWM5;1.3.2011;35;21
GWM5;1.4.2011;31;22
I already tried to order, transpose and determine duplicates. But transpose (even as a function: http://www.endmemo.com/program/R/transpose.php ) didn't gave me the right arrangement in individual cells (there were always 2 or more values in 1 cell) and determine duplicates only gave me single values, not rows.
Can you please help me? I would like to avoid putting them into the right order in Excel by hand.
The base R approach would be something like this:
mydf <- read.csv("your.file.txt", sep=";")
mydf$time <- with(mydf, ave(rep(1, nrow(mydf)), Date, FUN = seq_along))
reshape(mydf, direction = "wide", idvar=c("Name", "Date"), timevar="time")
# Date Name Value.1 Value.2
# 1 1.1.2011 GWM5 30 23
# 2 1.2.2011 GWM5 31 24
# 3 1.3.2011 GWM5 35 21
# 4 1.4.2011 GWM5 31 22
This seems about right, although there might be an easier way:
d <- read.csv2(text="
Date;Value;Name
1.1.2011;30;GWM5
1.2.2011;31;GWM5
1.3.2011;35;GWM5
1.4.2011;31;GWM5
1.1.2011;23;GWM5
1.2.2011;24;GWM5
1.3.2011;21;GWM5
1.4.2011;22;GWM5")
library(reshape2)
library(plyr)
d2 <- ddply(d,"Date",transform,n=paste0("Value",seq_along(Value)))
dcast(d2,Date+Name~n,value.var="Value")
## Date Name Value1 Value2
## 1 1.1.2011 GWM5 30 23
## 2 1.2.2011 GWM5 31 24
## 3 1.3.2011 GWM5 35 21
## 4 1.4.2011 GWM5 31 22
Related
Looking for advice on refining my code and also trimming to a date range.
The spreadsheet itself is pulled from another system and so the structure of the excel cannot be changed. When you pull the data it basically starts at E2, with the first date column in F2, and the first item in E3. The data will continue to populate to the right for as long as it goes on for. I have replicated the structure below.
AndI want it to look like:
I have come up with the below, which works, but I was looking for advice on refining it down to fewer individual step by steps.
In the below code:
= extracting data
= pulling the dates out
= formatting from
excel number to an actual date
= grabbing the item names
= transposing data and skipping some parts
= adding in dates to the row names
#1
df <- data.frame(read_excel("C:/example.xlsx",
sheet = "Sheet1"))
#2
dfdate <- gtb[1, -c(1,2,3,4,5)]
#3
dfdate <- format(as.Date(as.numeric(dfdate),
origin = "1899-12-30"), "%d/%m/%Y")
#4
rownames(gtb) <- gtb[,1]
#5
gtb <- as.data.frame(t(gtb[, -c(1,2,3,4,5)]))
#6
rownames(gtb) <- dfdate
After the row names have been added the structure is such that I am happy to start creating the visuals where needed.
thanks for your advice
David
Here is one suggestion, I don't really have easy access to your data, but I am including code to remove those columns as you do, based on their names, which can be nicer than removing by index.
df <- read.table( text=
"Item_Code 01/01/2018 01/02/2018 01/03/2018 01/04/2018
Item 99 51 60 69
Item2 42 47 88 2
Item3 36 81 42 48
",header=TRUE, check.names=FALSE) %>%
rename( `Item Code` = Item_Code )
library(tibble)
library(lubridate)
x <- df %>% select( -matches("Code \\d|Internal Code") ) %>%
column_to_rownames("Item Code") %>%
t %>% as.data.frame %>%
rownames_to_column("Item Code") %>%
mutate( `Item Code` = dmy(`Item Code`) )
x
Output:
> x
Item Code Item Item2 Item3
1 2018-01-01 99 42 36
2 2018-02-01 51 47 81
3 2018-03-01 60 88 42
4 2018-04-01 69 2 48
I went a bit forth and back with this solution, but it can be nice to also showcase how to remove columns by a regex on their column names, since you are removing several similarly named columns.
The t trick, that you also use, works becuase there is really only one more column there that would cause problems with this, as others have commented, and this can be temporarily stowed away as rownames. If that weren't the case, you're looking at a more complex solution involving pivot_wider and pivot_longer or splitting the data.frame and transposing only one of the halves.
Can anyone help arranging long data to wide data, but complicated by linked results, ie listed in wide format identified by Study Number this repetitive results listed in wide format after the SN (I've shown an abbreviated table there are more results per patient listed along the bottom with repetitive in columns LabTest, LabDate, Result, Lower, Upper)...I've tried melt and recast, and binding columns but can't seem to get it to work. Over 1000 results to reformat so can't input results manually need to reformat a long data excel document in wide format in R Thank you
Original data looks like this
SN LabTest LabDate Result Lower Upper
TD62 Creat 05/12/2004 22 30 90
TD62 AST 06/12/2004 652 6 45
TD58 Creat 26/05/2007 72 30 90
TD58 Albumin 26/05/2005 22 25 35
TD14 AST 28/02/2007 234 6 45
TD14 Albumin 26/02/2007 15 25 35
Formatted data should look like this
SN LabTCode LabDate Result Lower Upper LabCode LabDate Result Lower Upper
TD62 Creat 05/12/04 22 30 90 AST 06/12/04 652 6 45
TD58 Creat 26/05/05 72 30 90 Alb 26/05/05 22 25 35
TD14 AST 28/02/07 92 30 90 Alb 26/02/07 15 25 35
Formatted data looks like this
So far I have tried:
data_wide2 <- dcast(tdl, SN + LabDate ~ LabCode, value.var="Result")
and
melt(tdl, id = c("SN", "LabDate"), measured= c("Result", "Upper", + "Lower"))
Your issue is that R won't like the final table because it has duplicate column names. Maybe you need the data in that format but it's a bad way to store data because it would be difficult to put the columns back into rows again without a load of manual work.
That said, if you want to do it you'll need a new column to help you transpose the data.
I've used dplyr and tidyr below, which are worth looking at rather than reshape. They're by the same author but more modern and designed to fit together as part of the 'tidyverse'.
library(dplyr)
library(tidyr)
#Recreate your data (not doing this bit in your question is what got you downvoted)
df <- data.frame(
SN = c("TD62","TD62","TD58","TD58","TD14","TD14"),
LabTest = c("Creat","AST","Creat","Albumin","AST","Albumin"),
LabDate = c("05/12/2004","06/12/2004","26/05/2007","26/05/2005","28/02/2007","26/02/2007"),
Result = c(22,652,72,22,234,15),
Lower = c(30,6,30,25,6,25),
Upper = c(90,45,90,35,45,35),
stringsAsFactors = FALSE
)
output <- df %>%
group_by(SN) %>%
mutate(id_number = row_number()) %>% #create an id number to help with tracking the data as it's transposed
gather("key", "value", -SN, -id_number) %>% #flatten the data so that we can rename all the column headers
mutate(key = paste0("t",id_number, key)) %>% #add id_number to the column names. 't' for 'test' to start name with a letter.
select(-id_number) %>% #don't need id_number anymore
spread(key, value)
SN t1LabDate t1LabTest t1Lower t1Result t1Upper t2LabDate t2LabTest t2Lower t2Result t2Upper
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 TD14 28/02/2007 AST 6 234 45 26/02/2007 Albumin 25 15 35
2 TD58 26/05/2007 Creat 30 72 90 26/05/2005 Albumin 25 22 35
3 TD62 05/12/2004 Creat 30 22 90 06/12/2004 AST 6 652 45
And you're there, possibly with some sorting issues still to crack if you need the columns in a specific order.
I am attempting to repeatedly add a "fixed number" to a numeric vector depending on a specified bin size. However, the "fixed number" is dependent on the data range.
For instance ; i have a data range 10 to 1010, and I wish to separate the data into 100 bins. Therefore ideally the data would look like this
Since 1010 - 10 = 1000
And 1000 / 100(The number of bin specified) = 10
Therefore the ideal data would look like this
bin1 - 10 (initial data)
bin2 - 20 (initial data + 10)
bin3 - 30 (initial data + 20)
bin4 - 40 (initial data + 30)
bin100 - 1010 (initial data + 1000)
Now the real data is slightly more complex, there is not just one data range but multiple data range, hopefully the example below would clarify
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
Ideally I wish to get something like
10 20
20 30
30 40
.. ..
5000 5015
5015 5030
5030 5045
.. ..
4857694 4858096 # Note theoretically it would have decimal places,
#but i do not want any decimal place
4858096 4858498
.. ..
So far I was thinking along this kind of function, but it seems inefficient because ;
1) I have to retype the function 100 times (because my number of bin is 100)
2) I can't find a way to repeat the function along my values - In other words my function can only deal with the data 10-1010 and not the next one 5000-6500
# The range of the variable
width <- end - start
# The bin size (Number of required bin)
bin_size <- 100
bin_count <- width/bin_size
# Create a function
f1 <- function(x,y){
c(x[1],
x[1] + y[1],
x[1] + y[1]*2,
x[1] + y[1]*3)
}
f1(x= start,y=bin_count)
f1
[1] 10 20 30 40
Perhaps any hint or ideas would be greatly appreciated. Thanks in advance!
Aafter a few hours trying, managed to answer my own question, so I thought to share it. I used the package "binr" and the function in the package called "bins" to get the required bin. Please find below my attempt to answer my question, its slightly different than the intended output but for my purpose it still is okay
library(binr)
# Some fixed values
start <- c(10, 5000, 4857694)
end <- c(1010, 6500, 4897909)
tmp_list_start <- list() # Create an empty list
# This just extract the output from "bins" function into a list
for (i in seq_along(start)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
# Now i need to convert one of the output from bins into numeric value
s <- gsub(",.*", "", names(tmp$binct))
s <- gsub("\\[","",s)
tmp_list_start[[i]] <- as.numeric(s)
}
# Repeating the same thing with slight modification to get the end value of the bin
tmp_list_end <- list()
for (i in seq_along(end)){
tmp <- bins(start[i]:end[i],target.bins = 100,max.breaks = 100)
e <- gsub(".*,", "", names(tmp$binct))
e <- gsub("]","",e)
tmp_list_end[[i]] <- as.numeric(e)
}
v1 <- unlist(tmp_list_start)
v2 <- unlist(tmp_list_end)
df <- data.frame(start=v1, end=v2)
head(df)
start end
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
6 61 70
Pardon my crappy code, Please share if there is a better way of doing this. Would be nice if someone could comment on how to wrap this into a function..
Here's a way that may help with base R:
bin_it <- function(START, END, BINS) {
range <- END-START
jump <- range/BINS
v1 <- c(START, seq(START+jump+1, END, jump))
v2 <- seq(START+jump-1, END, jump)+1
data.frame(v1, v2)
}
It uses the function seq to create the vectors of numbers leading to the ending number. It may not work for every case, but for the ranges you gave it should give the desired output.
bin_it(10, 1010)
v1 v2
1 10 20
2 21 30
3 31 40
4 41 50
5 51 60
bin_it(5000, 6500)
v1 v2
1 5000 5015
2 5016 5030
3 5031 5045
4 5046 5060
5 5061 5075
bin_it(4857694, 4897909)
v1 v2
1 4857694 4858096
2 4858097 4858498
3 4858499 4858900
4 4858901 4859303
5 4859304 4859705
6 4859706 4860107
I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.
DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))