Picture-to-Picture Translation with FLUX.1: Instinct and Tutorial | by Youness Mansar | Oct, 2024

By Walt H

October 14, 2024

0

228

Generate new pictures primarily based on present pictures utilizing diffusion fashions.

Authentic picture supply: Photograph by Sven Mieke on Unsplash / Reworked picture: Flux.1 with immediate “An image of a Tiger”

This publish guides you thru producing new pictures primarily based on present ones and textual prompts. This method, introduced in a paper known as SDEdit: Guided Picture Synthesis and Modifying with Stochastic Differential Equations is utilized right here to FLUX.1.

First, we’ll briefly clarify how latent diffusion fashions work. Then, we’ll see how SDEdit modifies the backward diffusion course of to edit pictures primarily based on textual content prompts. Lastly, we’ll present the code to run your complete pipeline.

Latent diffusion performs the diffusion course of in a lower-dimensional latent area. Let’s outline latent area:

Supply: https://en.wikipedia.org/wiki/Variational_autoencoder

A variational autoencoder (VAE) tasks the picture from pixel area (the RGB-height-width illustration people perceive) to a smaller latent area. This compression retains sufficient data to reconstruct the picture later. The diffusion course of operates on this latent area as a result of it’s computationally cheaper and fewer delicate to irrelevant pixel-space particulars.

Now, lets clarify latent diffusion:

Supply: https://en.wikipedia.org/wiki/Diffusion_model

The diffusion course of has two elements:

Ahead Diffusion: A scheduled, non-learned course of that transforms a pure picture into pure noise over a number of steps.
Backward Diffusion: A realized course of that reconstructs a natural-looking picture from pure noise.

Notice that the noise is added to the latent area and follows a particular schedule, from weak to sturdy within the ahead course of.

Noise is added to the latent area following a particular schedule, progressing from weak to sturdy noise throughout ahead diffusion. This multi-step method simplifies the community’s process in comparison with one-shot technology strategies like GANs. The backward course of is realized via probability maximization, which is less complicated to optimize than adversarial losses.

Textual content Conditioning

Supply: https://github.com/CompVis/latent-diffusion

Era can also be conditioned on additional data like textual content, which is the immediate that you just may give to a Secure diffusion or a Flux.1 mannequin. This textual content is included as a “trace” to the diffusion mannequin when studying find out how to do the backward course of. This textual content is encoded utilizing one thing like a CLIP or T5 mannequin and fed to the UNet or Transformer to information it in direction of the best authentic picture that was perturbed by noise.

The thought behind SDEdit is straightforward: Within the backward course of, as an alternative of ranging from full random noise just like the “Step 1” of the picture above, it begins with the enter picture + a scaled random noise, earlier than operating the common backward diffusion course of. So it goes as follows:

Load the enter picture, preprocess it for the VAE
Run it via the VAE and pattern one output (VAE returns a distribution, so we want the sampling to get one occasion of the distribution).
Decide a beginning step t_i of the backward diffusion course of.
Pattern some noise scaled to the extent of t_i and add it to the latent picture illustration.
Begin the backward diffusion course of from t_i utilizing the noisy latent picture and the immediate.
Venture the end result again to the pixel area utilizing the VAE.
Voila !

Right here is find out how to run this workflow utilizing diffusers:

First, set up dependencies ▶️

pip set up git+https://github.com/huggingface/diffusers.git optimum-quanto

For now, you must set up diffusers from supply as this characteristic will not be out there but on pypi.

Subsequent, load the FluxImg2Img pipeline ▶️

import osfrom diffusers import FluxImg2ImgPipeline
from optimum.quanto import qint8, qint4, quantize, freeze
import torch
from typing import Callable, Checklist, Non-compulsory, Union, Dict, Any
from PIL import Picture
import requests
import io
MODEL_PATH = os.getenv("MODEL_PATH", "black-forest-labs/FLUX.1-dev")
pipeline = FluxImg2ImgPipeline.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
quantize(pipeline.text_encoder, weights=qint4, exclude="proj_out")
freeze(pipeline.text_encoder)
quantize(pipeline.text_encoder_2, weights=qint4, exclude="proj_out")
freeze(pipeline.text_encoder_2)
quantize(pipeline.transformer, weights=qint8, exclude="proj_out")
freeze(pipeline.transformer)
pipeline = pipeline.to("cuda")
generator = torch.Generator(system="cuda").manual_seed(100)

This code masses the pipeline and quantizes some elements of it in order that it suits on an L4 GPU out there on Colab.

Now, lets outline one utility operate to load pictures within the appropriate dimension with out distortions ▶️

def resize_image_center_crop(image_path_or_url, target_width, target_height):
"""
Resizes a picture whereas sustaining facet ratio utilizing middle cropping.
Handles each native file paths and URLs.
Args:
image_path_or_url: Path to the picture file or URL.
target_width: Desired width of the output picture.
target_height: Desired peak of the output picture.
Returns:
A PIL Picture object with the resized picture, or None if there's an error.
"""
attempt:
if image_path_or_url.startswith(('http://', 'https://')):  # Test if it is a URL
response = requests.get(image_path_or_url, stream=True)
response.raise_for_status()  # Elevate HTTPError for dangerous responses (4xx or 5xx)
img = Picture.open(io.BytesIO(response.content material))
else:  # Assume it is a native file path
img = Picture.open(image_path_or_url)
img_width, img_height = img.dimension
# Calculate facet ratios
aspect_ratio_img = img_width / img_height
aspect_ratio_target = target_width / target_height
# Decide cropping field
if aspect_ratio_img > aspect_ratio_target:  # Picture is wider than goal
new_width = int(img_height * aspect_ratio_target)
left = (img_width - new_width) // 2
proper = left + new_width
high = 0
backside = img_height
else:  # Picture is taller or equal to focus on
new_height = int(img_width / aspect_ratio_target)
left = 0
proper = img_width
high = (img_height - new_height) // 2
backside = high + new_height
# Crop the picture
cropped_img = img.crop((left, high, proper, backside))
# Resize to focus on dimensions
resized_img = cropped_img.resize((target_width, target_height), Picture.LANCZOS)
return resized_img
besides (FileNotFoundError, requests.exceptions.RequestException, IOError) as e:
print(f"Error: Couldn't open or course of picture from '{image_path_or_url}'.  Error: {e}")
return None
besides Exception as e: #Catch different potential exceptions throughout picture processing.
print(f"An sudden error occurred: {e}")
return None

Lastly, lets load the picture and run the pipeline ▶️

url = "https://pictures.unsplash.com/photo-1609665558965-8e4c789cd7c5?ixlib=rb-4.0.3&q=85&fm=jpg&crop=entropy&cs=srgb&dl=sven-mieke-G-8B32scqMc-unsplash.jpg"
picture = resize_image_center_crop(image_path_or_url=url, target_width=1024, target_height=1024)immediate = "An image of a Tiger"
image2 = pipeline(immediate, picture=picture, guidance_scale=3.5, generator=generator, peak=1024, width=1024, num_inference_steps=28, energy=0.9).pictures[0]

This transforms the next picture:

To this one:

Generated with the immediate: A cat laying on a shiny crimson carpet

You’ll be able to see that the cat has an identical pose and form as the unique cat however with a special shade carpet. Because of this the mannequin adopted the identical sample as the unique picture whereas additionally taking some liberties to make it extra becoming to the textual content immediate.

There are two essential parameters right here:

The num_inference_steps: It’s the variety of de-noising steps through the backwards diffusion, the next quantity means higher high quality however longer technology time
The energy: It management how a lot noise or how far again within the diffusion course of you need to begin. A smaller quantity means little adjustments and better quantity means extra important adjustments.

Now you know the way Picture-to-Picture latent diffusion works and find out how to run it in python. In my checks, the outcomes can nonetheless be hit-and-miss with this method, I often want to alter the variety of steps, the energy and the immediate to get it to stick to the immediate higher. The following step would to look into an method that has higher immediate adherence whereas additionally retaining the important thing parts of the enter picture.

Full code: https://colab.analysis.google.com/drive/1GJ7gYjvp6LbmYwqcbu-ftsA6YHs8BnvO

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Picture-to-Picture Translation with FLUX.1: Instinct and Tutorial | by Youness Mansar | Oct, 2024

Generate new pictures primarily based on present pictures utilizing diffusion fashions.

Now, lets clarify latent diffusion:

Textual content Conditioning

Related Articles

Advancing Embodied AI: How Meta is Bringing Human-Like Contact and Dexterity to AI

A Smarter Path to AI: Breaking the Boundaries to ROI from AI

A Frosty Beard for Santa STEM Problem

LEAVE A REPLY Cancel reply

Latest Articles

Advancing Embodied AI: How Meta is Bringing Human-Like Contact and Dexterity to AI

A Smarter Path to AI: Breaking the Boundaries to ROI from AI

A Frosty Beard for Santa STEM Problem

NASA’s Curiosity rover captures 360-degree view of Mars — and finds unusual sulfur stones

AI and Simulative Duties: What It Means for Your Job and Keep Forward | by Prajeesh Prathap | Nov, 2024

Picture-to-Picture Translation with FLUX.1: Instinct and Tutorial | by Youness Mansar | Oct, 2024

Generate new pictures primarily based on present pictures utilizing diffusion fashions.

Now, lets clarify latent diffusion:

Textual content Conditioning

Related Articles

LEAVE A REPLY Cancel reply

Stay Connected

Latest Articles