R: manipulate list of dataframes based on condition - r

I consider this question difficult, it is way over my level, and I would like some help to learn how to do this myself in the future. If I'm not providing enough information, or providing unclear information, please let me know.
I have a list of dataframes:
d1<-data.frame( Data0 = c("N,R,15,P,D", "_KEY_VALUE_1", -1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25),
Data1 = c("N,15,C,D", "Garden",0.9759,0.7121,0.7376,0.7647,0.7927,0.8209,0.8487,0.8759,0.9021,0.9274,0.9518,
1,1.0249,1.0514,1.0805,1.1132,1.1508,1.1946,1.2462,1.3071,1.3793,1.4649,1.5661,1.6854,1.8254,1.9887))
d2<-data.frame(
Data0=c("N,R,2,I,D","no_flowers",-2 , 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ,10 ,11) ,
Data1=c("N,15,C,D","Garden",0.8891 ,0.8891,0.9051,1,0.8891,0.8891,0.7907,0.8891,0.9929,0.8891,0.8891,0.8891,0.8891)
)
d3<-data.frame(Data0=c("A,X,15,P,D","_KEY_TEXT_1","Y","N","U"),
Data1=c("N,15,C,D","Garden",1.0834,1,1))
d4<-data.frame(
Data0=c("A,X,15,P,D","_KEY_TEXT_1","Y","Y","Y","Y","Y","Y","N","N","N","N","N","N"),
Data1=c("N,R,3,I,D","house_age",16,18,19,20,21,50,16,18,19,20,21,50),
Data2=c("N,15,C,D","Garden",2.2291,2.0743,1.9369,1.8148,1.7064,1.6102,2.2291,2.0743,1.9369,1.8148,1.7064,1.6102)
)
dfl<-list(d1,d2,d3,d4)
names(dfl)<-c("no_animals","no_flowers","radiation","summer_x_house_age")
If you see the first value of the first columns in each dataframe, the second letter (after the first comma) is either R or X. R stands for Ranged and X stands for not Ranged. I would like, if the letter is "R" (Ranged), to manipulate the column into two columns, i.e. I would like the result for the d1 dataframe to look like this:
For the d4 dataframe, an interaction between "summer" (Y/N) and "house age", we see that only the second column (house age) is ranged, so I would like to do the same as for d1, but for both summer=Y and summer=N.
A little bit of background on the data frames, if it makes things easier to understand:
This is the results of a glm-model I have made outside of R, and I wish to import it to R. The last column of the dataframe is always the beta-values of the regression, and the column(s) before are the variables, which sometimes are categorical (X) and sometimes continous (R). When they are continous/ranged, I must manipulate the column to get "from" and "to", because I want to use this list to calculate probabilities for some data where I have values of the regressors I have used in my glm-model. The upmost number means "from & not including infinity, to & including upmost number", second upmost number means "from & not including upmost number, to & including second upmost number", and so on.

Thnk I've got it.
Define a new function which looks for the key letter (R or X) and returns either a new data frame (if R) or the same data frame (if X).
Rcheck <- function(df){
# Isolate the letter being tested for R or X
key_letter <- substr(as.character(df[1,1]),3,3)
if( key_letter == "R"){ # Proceed if letter is R
# Assign new dataframe
df_new <- df
# Add new column.
df_new[,'Data0_'] <- as.character(df_new[,'Data0'])
# Shift down and add -9999 value
rows <- nrow(df_new)
df_new[,'Data0_'][4:rows] <- as.character(df_new[,'Data0'][3:(rows-1)])
df_new[,'Data0_'][3] <- "-9999"
# Take new column from the end and put it beside Data0
column1_name <- colnames(df_new)[1]
new_column_name <- colnames(df_new)[ncol(df_new)]
other_column_names <- colnames(df_new)[2:(ncol(df_new)-1)]
df_new <- df_new[,c(column1_name, new_column_name, other_column_names)]
df_new
} else{ # If letter is not R
df
}
}
Then apply this function to your list of data frames using lapply.
new_list <- lapply(dfl, Rcheck)

Related

Dividing one dataframe into many with names in R

I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?

Separating or grouping Values of a column into different categories in R

Good morning everyone.
Please I do have a problem that I have not been able to solve for quite some time now.(please take a look at the image link to see a screen shot of my data set) https://i.stack.imgur.com/g2eTM.jpg
I have a column of data (status) containing two set of values (1 and 2). These are dummies representing two categories (or status) of dependent Variables (say Pp and Pt) that I need for a regression. their actual values are contained the last column Pp.Pt (Pp.Pt is just a name nothing more).
I need to run two separate regressions each using either Pp or Pt (meaning using their respective values in the Pp.Pt column (each value in the last column is either of status 1 or of status 2) . **My question is How do I separte them or group them into these two categories 1= Pp and 2 = Pt so that i could clearly identitify and group them.
https://i.stack.imgur.com/g2eTM.jpg
Thank you very much for your kind help.
Best
Ludovic
Split-Apply-Combine method :
# Using the mtcars dataset as an example:
df <- mtcars
# Allocate some memory for a list storing the split data.frame:
# df_list => empty list with the number of elements of the unique
# values of the cyl vector
df_list <- vector("list", length(unique(df$cyl)))
# Split the data.frame by the cyl vector:
df_list <- split(df, df$cyl)
# Apply the regression model, return the summary data:
lapply(df_list, function(x){
summary(lm(mpg ~ hp, data = x))
}
)
this approach can fix your issue
yourdata %>%
mutate(classofyourcolumn=ifelse(columntosplit<quantile(columntosplit,0.5),1,0))

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

variable dataframe in R

Say I have loaded a csv file into R with two columns (column A and column B say) with real value entries. Call the dataframe df. Is there away of speeding up the following code:
dfm <- df[floor(A) = x & floor(B) = y,]
x <- 2
y <- 2
dfm
I am hoping there will be something akin to function e.g.
dfm <- function(x,y) {df[floor(A) = x & floor(B) = y,]}
so that I can type
Any help much appreciated.
The way that's written right now won't work for a few reasons:
You need to assign values to x and y before you assign dfm. In other words, the lines x <- 2 and y <- 2 must come before the dfm <- ... line.
R doesn't know what A and B are, even if you put them inside the brackets of the dataframe that contains them. You need to write df$A and df$B.
= is the assignment operator, but you're looking for the logical operator ==. Right now your code is saying "Assign the value x to floor(A)" (which doesn't really make sense). You want to tell it "Only choose rows where floor(A) equals x", or floor(A)==x.
So what you want is:
dfm.create <- function(x,y) {df[floor(df$A)==x & floor(df$B)==y,]}
dfm <- dfm.create(2,2)
Note that if you want the dataframe to be called dfm, you don't want to name the function dfm, or you will have to erase the function to make the dataframe.

Remove rows of a dataframe that match a factor level (and then plot the data excluding that factor level)

I have a data frame with 251 observations and 45 variables. There are 6 observations in the middle of the data frame that i'd like to exclude from my analyses. All 6 belong to one level of a factor. It is easy to generate a new data frame that, when printed, appears to exclude the 6 observations. When I use the new data frame to plot variables by the factor in question, however, the supposedly excluded level is still included in the plot (sans observations). Using str() confirms that the level is still present in some form. Also, the index for the new data frame skips 6 values where the observations formerly resided.
How can I create a new data frame that excludes the 6 observations and does not continue to recognize the excluded factor level when plotting? Can the new data frame be made to "re-index", so that the new index does not skip values formerly assigned to the excluded factor level?
I've provided an example with made up data:
# ---------------------------------------------
# data
char <- c( rep("anc", 4), rep("nam", 3), rep("oom", 5), rep("apt", 3) )
a <- 1:15 / pi
b <- seq(1, 8, .5)
d <- rep(c(3, 8, 5), 5)
dat <- data.frame(char, a, b, d)
dat
# two ways to remove rows that contain a string
datNew1 <- dat[-which(dat$char == "nam"), ]
datNew1
datNew2 <- dat[grep("nam", dat[ ,"char"], invert=TRUE), ]
datNew2
# plots still contain the factor level that was excluded
boxplot(datNew1$a ~ datNew1$char)
boxplot(datNew2$a ~ datNew2$char)
# str confirms that it's still there
str(datNew1)
str(datNew2)
# ---------------------------------------------
You can use the drop.levels() function from the gdata package to reduce the factor levels down to the actually used ones -- apply it on your column after you created the new data.frame.
Also try a search for r and drop.levels here (but you need to make the search term [r] drop.levels which I can't here as it interferes with the formatting logic).
Starting with R version 2.12.0, there is a function droplevels, which can be applied either to factor columns or to the entire dataframe. When applied to the dataframe, it will remove zero-count levels from all factor columns. So your example will become simply:
# two ways to remove rows that contain a string
datNew1 <- droplevels( dat[-which(dat$char == "nam"), ] )
datNew2 <- droplevels( dat[grep("nam", dat[ ,"char"], invert=TRUE), ] )
I have pasted something from my code- I have an enclosure experiment in a lake- have measurements from enclosures and the lake but mostly dont want to deal with lake:
my variable is called "t.level" and the levels were control, low medium high and lake-
-this code makes it possible to use the nolk$ or data=nolk to get data without the "lake"..
nolk<-subset(mylakedata,t.level == "control" |
t.level == "low" |
t.level == "medium" |
t.level=="high")
nolk[]<-lapply(nolk, function(t.level) if(is.factor(t.level))
t.level[drop=T]
else t.level)

Resources