How to interpolate values using a for loop? - r

I am trying to interpolate values from a data frame i have imported from excel. In the table are two columns (value 1 and Value 2) that I am trying to interpolate for each unique "Name". The data frame contains 550 rows with 90 unique Names (so each Name has more than one "Value1" value and more than one "Value2" value). There are also a bunch of irrelevant columns in the dataframe which I don't have a use for.
Example of data frame:
Name Value1 Value2 NotImportantvalue1 NotImportantvalue2
A 1 1 ABC ABC
A 2 1 ABC ABC
B 40 40 ABC ABC
C 30 30 ABC ABC
C 1 2 ABC ABC
D 2 400 ABC ABC
D 3 500 ABC ABC
D 40 2 ABC ABC
I've been messing around with for loops that cycles through a dataframe containing the unique values of Names trying to make it go through the Value1/Value2 columns in my.data where the name in the "Name" column matches the name in the unique dataframe, but I'm not getting the results I want.
Where I'm currently at with my code is to try to get Value1 and Value2 when i in the dataframe "Name" matches the value in the column "Name" in my.data and saving as a dataframe with the same name. After that I have to figure out how to interpolate the values in each dataframe.
#Set working directory
setwd("H:\\R-project")
#Set file path
file <- file.path("Data.xlsx")
#set library
library(XLConnect)
#Read data
my.data <- readWorksheetFromFile(file,sheet=1,startRow=1)
#Unique Names
Name <- data.frame(unique(my.data$Names))
colnames(Name) <- "Name"
for (i in Name$Name) {
assign(i, data.frame(my.data$Value1[my.data$Name==Name[[i],]], my.data$Value2[my.data$Name==Name[[i],]])
}
I'm also not sure if using 90 individual data.frames is the way to go or if i should use something like
name_list <- split(my.data, as.factor(my.data$name))
and interpolate from the list directly (although i don't know exactly how to do that either, for loops aren't my strong point).
Any guidance or help on how to continue would be greatly appreciated!

As you suggested
name_list <- split(my.data, my.data$name)
will give you a list of data frames that have been split by name
You can operate on that list using something like the following
lapply(name_list, function(x) approx(x$Value1, x$Value2))
You will need to provide more details on the desired output if you want a more specific answer.

Related

how to check the same ID in a different dataframe and make the merge file?

I want to modify the longitudinal data based on ID.
I want to check whether the IDs in data wave 1(A) and data in wave 2(B) match properly. Also, I want to combine the data of A and B into one file based on ID.
I tried to merge the file using merge() code and tried to check whether the ID matched through the sex variable. However, it is difficult to check ID if there is no same variable in both waves, and it does not directly check each ID.
ID <- c(1012,1102,1033,1204,1555)
sex <- c(1,0,1,0,1)
A <- cbind(ID,sex)
A <- as.data.frame(A)
ID <- c(1006,1102,1001,1033,1010,1234,1506,1999)
sex <- c(1,0,1,1,1,0,0,0)
B <- cbind(ID,sex)
B <- as.data.frame(B)
merge.AB<-merge(A,B,by="ID")
all(merge.AB$sex.x == merge.AB$sex.y)
1. Are there any way to merge A(wave1) and B(wave2) files by ID other than merge() code?
Since there are 2 or 3 wave1 files other than A, it would be nice to be able to combine several files at once.
2. Is there a way to check if two frames have the same ID directly?
I tried to combine the files and check matching IDs through cbind() code for combining the A and B. But I couldn't check them together because the number of rows between the A and B dataframe is different.
It would be helpful to use a loop(e.g. if, for, etc.), but it would be nice if there was a way to do it with a package or simple code.
3. How do I locate a row number with a mismatched ID?
I want to know the all of locations in the row(row number) for the example.
e.g.
mismatched ID in A: 1012,1204,1555
mismatched ID in B: 1006,1001,1010,1234,1506,1999
Question 1 : you can merge multiple dataframes with merge. You first need to create a list of the df you want to merge and then you could use Reduce.
df_list <- list(df1,df2,...dfn)
data=Reduce(function(x,y) merge(x=x,y=y,by="ID",all=T),df_list)
Alternatively using tidyverse:
library(tidyverse)
df_list %>% reduce(full_join, by='ID')
In your example, pay attention that it is not convenient to merge two df with the same variable name and that contain the same information. You could simply use
data=Reduce(function(x,y) merge(x=x,y=y,all=T), df_list)
to remove redundant information from merged df.
Question 2 : check IDs with setdiff() and intersect()
intersect() gives you the common values between two vectors
setdiff(x,y) gives you the values in x that are not present in y
intersect(A$ID,B$ID)
[1] 1102 1033
setdiff(A$ID,B$ID)
[1] 1012 1204 1555
setdiff(B$ID,A$ID)
[1] 1006 1001 1010 1234 1506 1999
Question 3 : a simple which() including %in% test will give you the position in the dataframe
which(!(A$ID %in% B$ID))
[1] 1 4 5
which(!(B$ID %in% A$ID))
[1] 1 3 5 6 7 8

Separating multiple values within one cell into multiple cells

I am working with patent data and I have multiple values (International Patent Classifications) separated by a comma in one cell. For instance: C12N, C12P, A01 (within one cell). I would like to analyze how many different 4-digit classifications each patent has. To do so, I need to, first, separate the individual values and put each of them into an individual cell. Second, I need to count the unique values (each placed then in separate columns) for each row.
How can I separate the individual values within one cell into multiple cells within Excel or R. Is there any excel or R function you can suggest?
Here is reproducible example on how the data looks like in R or in Excel.
#Example of how the data looks like
Publication_number<-c(12345,10012,23323,44556,77999)
IPC_class_4_digits<-c("C12N,CF01,C345","C12P,F12N,F039","A014562,F23N", "A01C, A01B, A01F, A01K, A01G", "C10N, C10R, C10Q, C12F")
data_example<-cbind(Publication_number, IPC_class_4_digits)
View(data_example)
The expected about should be a column "counts" counting the number of different 4-digit numbers. In this case => c(3, 3, 2, 5, 4)
Assuming you have a dataframe with two columns Publication_number and IPC_class_4_digits you could use cSplit from splitstackshape package:
library(splitstackshape)
# assuming your data
df <- data.frame(Publication_number, IPC_class_4_digits)
cSplit(df, 'IPC_class_4_digits', ',')
Output:
Publication_number IPC_class_4_digits_1 IPC_class_4_digits_2 IPC_class_4_digits_3
1: 12345 C12N CF01 C345
2: 1001 C12P F12N F039
3: 2332 A014562 F23N <NA>
You can split the string on comma and count it's length.
data_example$count <- lengths(strsplit(data_example$IPC_class_4_digits, ','))
data_example
# Publication_number IPC_class_4_digits count
#1 12345 C12N,CF01,C345 3
#2 10012 C12P,F12N,F039 3
#3 23323 A014562,F23N 2
#4 44556 A01C, A01B, A01F, A01K, A01G 5
#5 77999 C10N, C10R, C10Q, C12F 4
Or another option is to use str_count -
data_example$count <- stringr::str_count(data_example$IPC_class_4_digits, ',') + 1
data
data_example<-data.frame(Publication_number, IPC_class_4_digits)

R: filter dataframe by column name with a string match in a different dataframe column

I have two data frames shown below. What I would like to do is subset the first data frame to keep only columns whose column name appears in a column from the second data frame, as well as columns with a partial string match to one specific string. The actual data is much longer with more varied names so I need something that can be easily applied to all of them.
df1:
abc1
abc2
acd1
abd1
acd2
xxx1
xxx2
1
2
3
4
5
6
7
df2:
sample
total
abc1
5
abc2
4
desired df3:
abc1
abc2
xxx1
xxx2
1
2
6
7
Here is what I tried
keep <- df2$sample
df3 <- df1 %>% select(contains(keep))
which kept all columns who had a partial string match, not a complete string match
keep <- df2$sample
df3 <- filter(df1, grepl(keep,colnames(df1)))
which gave me an error that input 1 must be of size 1037 or 1, not 160
(1037= #of rows in df1, 160= #of columns)
Additionally, this does not deal with the xxx columns. For that I have tried the following
cols <- colnames(df1)
keep <- list.append(keep, colnames(df1) %>% select(contains("xxx")))
keep <- list.append(keep, filter(colnames(df1), grepl("xxx",df1)))
keep <- list.append(keep, cols %>% select(contains("xxx")))
keep <- list.append(keep, filter(cols, grepl("xxx",cols)))
keep <- list.append(keep, grepl("xxx",cols))
resulting in the error
no applicable method for x applied to an object of class "character"
where x is the function like filter
keep <- list.append(keep, grepl("xxx",colnames(df1)))
which appended a true/false result for each column name to the list.
I am not attached to this way of doing things, so any and all solutions are appreciated, a list just seemed like the easiest way to me.
As per Martin Gals comment
df1 %>% select(contains("xxx"), df2$sample)

How can you extract string patterns from a column in a data frame and create a new data frame column containing the extracted strings?

I am trying to create a search key column for my data frame. I would like to extract certain string patterns from my data frames column to be used for creating a new search key column along the length of my data frame. For example:
x <- c(1:4)
y <- c("BLUE,BALL","BALL,RED","BIG,GREEN,BALL","BALL")
dat <- data.frame(x,y)
which gives,
>dat
x y
1 BLUE,BALL
2 BALL,RED
3 BIG,GREEN,BALL
4 BALL
Now I would like to make a new search key column in the data frame based off of the occurrences of color patterns in dat$y. I would like to use :
pattern="RED|GREEN|BLUE"
For any instances of 'pat' not being recognized in dat$y, I would like to leave empty or 'NA' in the elements place. I would like my results to look something like below:
>new.dat
x y search.color
1 BLUE,BALL BLUE
2 BALL,RED RED
3 BIG,GREEN,BALL GREEN
4 BALL NA
I have used
dat$first <-do.call(rbind,lapply(strsplit(dat[,2],split=" "), function(x) head(x,1)))
for creating a first word search key filter along my data frame but now I am searching for methods that allow more control for selecting search keys using grepl or other means. Any help or resources are greatly appreciated.
stringr::str_extract should do what you want easily.
pat <- "(RED|GREEN|BLUE)"
dat <- transform(dat,search.color=stringr::str_extract(y,pat))
## dat
## x y search.color
## 1 1 BLUE,BALL BLUE
## 2 2 BALL,RED RED
## 3 3 BIG,GREEN,BALL GREEN
## 4 4 BALL <NA>
I'm sure there's a base-R gsub() solution as well, but it's not as obvious to me ...
We could also use gregexpr/regmatches from base R
dat$search.color <- sapply(regmatches(dat$y,gregexpr(pat, dat$y)),`[`,1)
dat$search.color
#[1] "BLUE" "RED" "GREEN" NA
data
pat <- "(RED|GREEN|BLUE)"

R functions that output datasets

I am a bit new to R and am trying to use a function to output a dataframe. I have several dataframes that need deduplication. Each record in the data frame has an index variable (RecID) and a patient ID (PatID). If patients are listed multiple times in the dataframe, I want to choose the record largest RecID.
I want to be able to change this data frame:
PatID RecID
1 1
1 2
2 3
3 4
3 5
4 6
Into this dataframe
PatID RecID
1 2
2 3
3 5
4 6
I can use the following code to successfully deduplicate the dataframe.
df <- df[order(df$PatID, -df$RecID),]
df <- df[ !duplicated(df$PatID), ]
I created a function with this code so I can apply my deduplication scheme across multiple data frames easily.
dedupit <- function(x) {
x <- x[order(x$PatID, -x$RecID),]
x <- x[ !duplicated(x$PatID), ]
}
However, when I put use the code dedupit(df), it does not create a new df dataframe with deduplicated records.The function won't output the final dataframes or any of the intermediate dataframes. Is there a way to have functions output dataframes?
You need to put return(x) at the end of your function.

Resources