Assign column name based on key in preceding column - r

I am importing a data set from a csv
the top two lines are key value pairs
UTC
BSP
AWA
TWA
1.
2.
3.
4.
Then the data below this has a column containing the value and then the corresponding data. for example
1
41223
2
0
3
045
4
026
1
41224
2
0
3
052
4
035
1
41225
2
0
3
087
4
040
1
41226
2
0
3
023
4
041
2
0
3
052
4
082
I want it to become
UTC
BSP
AWA
TWA
41223
0
045
026
I tried putting the key value pairs in a list
test <- read.csv("my.csv",
nrows=2, header=FALSE)
key <- as.character(test[1,])
value <- as.numeric(test[2,])
mylist <- list()
for(i in 1:length(key)){
mylist[key[i]] <- value[i]
}
And then I was trying to match or reference the previous column with the values in the list. I have not managed to do.
Any help much appreciated
Thanks

Related

How to add columns from another data frame where there are multible matching rows

I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.
You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")
An option with left_join
library(dplyr)
left_join(Graduations, Occupations)

R Quickly merge list of dataframes with datatable

I have a datatable called csv_t with around 4.5m rows. I have a list of data.frames called binded2 which has 1,874 list elements. I'm trying to merge both by the column index in binded2 and tpl in csv_t.
Because these datasets are so large, I am having trouble trying to merge csv_t with each data.table of binded2 quickly.
This is what I'm trying right now:
func <- function(x,y){merge(x, y, by.x=names(x)[4], by.y=names(y)[1])}
binded3 <- lapply(binded2, func,csv_t)
What binded2 and csv_t looks like
> binded2 # only first 3 elements shown
[[1]]
start end tpl index
1 1 1 2222733 2222733
2 6 6 2222733 2222738
[[2]]
start end tpl index
1 1 1 2222736 2222736
2 6 6 2222736 2222741
[[3]]
start end tpl index
1 4 4 2222750 2222753
> head(csv_t)
tpl strand base score ipdRatio
1: 3239 0 G 6 2.684
2: 3240 0 C 6 1.764
3: 3241 0 T 7 1.861
4: 3243 0 C 13 1.410
5: 3244 0 A 0 0.238
6: 3245 0 C 6 1.261
I would like to merge these two by matching up column tpl in csv_t and column index in binded2. I appreciate your help
dropbox link to csv_t.csv

How to lookup value higher in data.frame in R

I have a time series data.frame where all values are below each other. But on every date there are more cases that come back regulary. Based on the time series I am adding a column with some calculations. These calculations are done case-specific. But for these calculations I need the value of the previous date of that case. I have now idea about which function to use. Can anybody point me to a function or an example somewhere on the net? Thanks!!
To be clear, this is what I mean. On date 1 the old value (before the score) for case 'a' is 1200. Based on the score of 1 the new value becomes 1250. On date 2 the I want this new value 1250 for placed in the column 'old value' (and than do some calculations to come to the new value, that new value has to be placed again in de column old value on date 4 or so et cetera). For case B the same. So the new value after the score on date 1 is 1190 and has to be placed in the correct row on date 3 (on date 2 there is now case B) et cetera for 1000's of cases and dates.
date name_case score old_value new_value
1 a 1 1200 1250
1 b 2 1275 1190
1 c 1 1300 1310
2 a 3 1250
2 c 1 1310
3 B 1 1190
Maybe this will do it. Assuming that we start with:
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 NA NA
5 2 c 1 NA NA
6 3 b 1 NA NA # note ... fixed cap issue
And then make a subset with values for new_value:
dat1 <- dat[ !is.na(dat$old_value), ]
And then replace the NA old_values with results from new_values in the subset by match-ing on name_case
dat[ is.na(dat$old_value) , "old_value" ] <-
dat1$new_value[ match(dat[ !is.na(dat$old_value) ,"name_case" ],
dat1$name_case)]
match generates a numeric vector that is used to index the new_values.
> dat
date name_case score old_value new_value
1 1 a 1 1200 1250
2 1 b 2 1275 1190
3 1 c 1 1300 1310
4 2 a 3 1250 NA
5 2 c 1 1190 NA
6 3 B 1 1310 NA

Creating a column in R which is based in information in other columns

My data set consists of the columns CODE, FIRE, CARBON, and DEPTH
example of data in the column CODE are chic, chiq, lopc, lopq etc.
Besides this I got information in which is mentioned that each value, chic for example, in the column CODES belongs to either A, B or C.
So I want to create in R a new column which is based on the CODE column. So that when I say to R that when CODE is for example chic, the value in the new column is A on the same row.
I hope someone could help me with this,
Thanks in advance!
( I wanted to insert a column in here with an example of the data, but it didnt work, I hope it is now also clear what I want to create)
You could also use (assuming your data is a in a dataframe df)...
df$NEW[df$CODE == "chic"] <- "A"
df$NEW[df$CODE == "chiq"] <- "B"
df
# CODE DEPTH CARBON FIRE NEW
# 1 chic 0 24.2 U A
# 2 chic 10 25.6 U A
# 3 chiq 0 20.3 B B
# 4 chiq 0 50.3 B B
# 5 chiq 10 40.5 B B
# 6 chiq 20 24.2 B B
# 7 polc 0 60.5 U <NA>
# 8 polc 10 10.2 U <NA>
# 9 polq 0 20.5 U <NA>
etc...
To recode multiple values at once, use %in% and a vector for example:
df$NEW[df$CODE %in% c("chic", "chiq")] <- "A"
You can try :
yourdata$newcolumn<-sapply(as.character(yourdata$CODE),
switch,
"chic"="A","code2"="B","code3"="C")
Of course, you'll need to adapt the code to your data but you didn't give a lot of details...
Example :
with yourdata the below data.frame :
CODE FIRE CARBON DEPTH
1 chic 1 3 -1.3107045
2 code2 1 3 1.0781249
3 code3 1 3 -0.6762211
4 chic 1 3 0.9196376
5 code2 2 4 -0.7161380
6 code2 2 4 -0.5506980
7 chic 2 4 0.3323771
8 chic 2 4 -0.4131832
you get the new data.frame yourdata :
CODE FIRE CARBON DEPTH newcolumn
1 chic 1 3 -1.3107045 A
2 code2 1 3 1.0781249 B
3 code3 1 3 -0.6762211 C
4 chic 1 3 0.9196376 A
5 code2 2 4 -0.7161380 B
6 code2 2 4 -0.5506980 B
7 chic 2 4 0.3323771 A
8 chic 2 4 -0.4131832 A

Merging columns from different data frames

I have a problem....
I have two data frames
>anna1
name from to result
11 66607 66841 0
11 66846 67048 0
11 67053 67404 0
11 67409 68216 0
11 68221 68786 0
11 68791 69020 0
11 69025 69289 0
11 69294 70167 0
11 70172 70560 0
and the second data frame is
>anna2
name from to result
11 66607 66841 5
11 66846 67048 6
11 67409 68216 7
11 69025 69289 12
11 70172 70560 45
What I want is to create a new data frame similar with the anna1 where all the 0 values will be replaced by the correct results in the correct row from the anna2
you are going to notice that in the anna2 data frame, in the from and to columns have only some same values with the respective in the anna1 data frame
....the intermediate are missing
So i need somehow to take the numbers from the result column in the anna2 and put them in the correct row in the anna1
thank you in advance
Best regards
Anna
A simpler merge:
anna3 <-merge(anna2,anna1[,1:3], all.y=TRUE)
anna3[is.na(anna3)] <- 0
Gives:
> anna3
name from to result
1 11 66607 66841 5
2 11 66846 67048 6
3 11 67053 67404 0
4 11 67409 68216 7
5 11 68221 68786 0
6 11 68791 69020 0
7 11 69025 69289 12
8 11 69294 70167 0
9 11 70172 70560 45
If the "from" column is guaranteed to be unique in both anna1 and anna2, AND every row in anna2 has a matching row in anna1 (though not vice versa), a simple solution is
row.index = function(d) which(anna1$from == d)[1]
indices = sapply(anna2$from, row.index)
anna1$result[indices] = anna2$result
Another approach
require(plyr)
anna <- rbind(anna1, anna2)
ddply(anna, .(name, from, to), summarize, result = sum(result))
EDIT. If the data frames are large, and speed is an issue, think of using data.table
require(data.table)
data.table(anna)[,list(result = sum(result)),'name, from, to']
You can use merge, but you have to explicitly specify what should be done with the two result columns.
d <- merge(anna1, anna2, by=c("name", "from", "to"), all=TRUE)
d$result <- ifelse(d$result.x == 0 & !is.na( d$result.y ), d$result.y, d$result.x)
d <- d[,c("name", "from", "to", "result")]

Resources